Introduction to Open Data Science - Course Project

About the project

This is a project of the course Introduction to Open Data Science.

I heard about this course in one of the emails for doctoral students. I’m hoping to:

Here is the link to my Github repository.

# This is a so-called "R chunk" where you can write R code.

date()
## [1] "Thu May 06 22:19:23 2021"

Regression and model validation

date()
## [1] "Thu May 06 22:19:23 2021"

This chapter describes an application of regression model.

Data

The data, which is analyzed in this chapter, is a subset of International survey of Approaches to Learning data set taken from here. Variable descriptions and other meta data of this data set can be found following this link.

The subset, which is analyzed in this chapter, can be found in my github page along with R code of its derivation.

Structure of the data

The first task is to read the data and check its structure. This is done using R code below.

# Reading the data file
learning2014 <- read.table("data/learning2014.csv", sep=",", header=TRUE)

#Looking at the dimensions and structure of the data
dim(learning2014)
## [1] 166   7
str(learning2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...

Data consists of 166 observations of 7 variables. These variables are:

  1. gender (character type variable);
  2. age (integer type variable);
  3. attitude (integer type variable describing students attitude towards statistics);
  4. deep (numeric type variable describing questions related to deep learning);
  5. stra (numeric type variable describing questions related to strategic learning);
  6. surf (numeric type variable describing questions related to surface learning);
  7. points (integer type variable describing exam points).

Graphical overview and summary of the data

The second task is to show a graphical overview of the data and show summaries of the variables.

For drawing plots ggplot2 library will be used.

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.0.3

Summary of the variables can be looked up using summary function below.

# Looking at summary statistics of the variables
summary(learning2014)
##     gender               age           attitude          deep      
##  Length:166         Min.   :17.00   Min.   :1.400   Min.   :1.583  
##  Class :character   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333  
##  Mode  :character   Median :22.00   Median :3.200   Median :3.667  
##                     Mean   :25.51   Mean   :3.143   Mean   :3.680  
##                     3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083  
##                     Max.   :55.00   Max.   :5.000   Max.   :4.917  
##       stra            surf           points     
##  Min.   :1.250   Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.625   1st Qu.:2.417   1st Qu.:19.00  
##  Median :3.188   Median :2.833   Median :23.00  
##  Mean   :3.121   Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.625   3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :5.000   Max.   :4.333   Max.   :33.00

The summary table shows minimum, maximum, mean, median, 1st and 3rd quantile values of each numeric variable. The age of respondents is between 17 and 55 years with the median of 22 years. Exam points vary from 7 to 33 with the median of 22 points. Attitude and learning strategy variables all averages around 3.

In the next step, various plots are used for visualizing the variables and their relationships.

# Making a table of how many responses are from male and how many from female
table_gender <- table(learning2014$gender)

# Visualizing result with a simple barplot

ggplot(data=as.data.frame(table_gender), aes(x=Var1,y=Freq)) +
  geom_bar(stat="identity", fill="darkblue")+ theme_minimal()+
  labs(x = "Gender", y = 'Count')+ ggtitle("Students' gender")

The number of female respondents is double the size of male respondents.

# Ploting a histogram of age variable
ggplot(learning2014, aes(x=age)) +theme_minimal() + geom_histogram(color="darkblue", fill="lightblue", binwidth=5) + ggtitle("Histogram of students' age")

Most of the respondents are around age 20.

# Ploting a histogram of attitude variable
ggplot(learning2014, aes(x=attitude)) +theme_minimal() + geom_histogram(color="darkblue", fill="lightblue", binwidth=0.5) + ggtitle("Histogram of students' attitude")+ stat_function(fun = function(x) dnorm(x, mean = mean(learning2014$attitude), sd = sd(learning2014$attitude)) * 0.5 * nrow(learning2014))

Students attitude seems to resemble a normal distribution.

#Vizualising attitude vs points data points coloring them by gender and adding regression lines
ggplot(learning2014, aes(x = attitude, y = points, col = gender))+ geom_point()+geom_smooth(method = "lm")+ ggtitle("Student's attitude versus exam points") +theme_minimal()
## `geom_smooth()` using formula 'y ~ x'

From the last plot, it can be sen that with better attitude, exam points tend to increase.

# Drawing a scatter plot matrix of the variables in learning2014.
# [-1] excludes the first column (gender)
pairs(learning2014[-1])

Pairwise scatter plots of the variables does not show very clear linear relationships. The most noticeable linear relationship seems to be between the exam points and the attitude as in the previous plot.

Fitting the regression model

The third and fourth tasks are to fit the regression model and explain it.

Linear regression is an approach for modeling the relationship between a dependent variable \(y\) and one or more explanatory variables \(X\).

I chose attitude, deep and strategic learning variables as explanatory variables for the dependent points variable. In this case the model is of a form:

\(points = x_0 + x_1*attitude + x_2 * deep + x_3* stra + e,\) here \(e\) is an un-observable random variable which is assumed to add noise to the observations.

In the below code linear model is fitted for estimating the coefficients \(x_0, x_1, x_2\).

# Fitting a linear model
my_model <- lm(points ~ attitude + deep + stra, data = learning2014)

# Printing out a summary of the model
summary(my_model)
## 
## Call:
## lm(formula = points ~ attitude + deep + stra, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.5239  -3.4276   0.5474   3.8220  11.5112 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.3915     3.4077   3.343  0.00103 ** 
## attitude      3.5254     0.5683   6.203 4.44e-09 ***
## deep         -0.7492     0.7507  -0.998  0.31974    
## stra          0.9621     0.5367   1.793  0.07489 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 162 degrees of freedom
## Multiple R-squared:  0.2097, Adjusted R-squared:  0.195 
## F-statistic: 14.33 on 3 and 162 DF,  p-value: 2.521e-08

When the p-value is less than the chosen significance level, we can reject the null hypothesis that the coefficient of model’s variable is equal to 0. It means that the variable adds significant benefit to the model. In this model above, only intercept and attitude has p-values smaller than significance level 0.05. Therefore, I have removed deep and stra variables from the model and build the second model including only attitude variable, which has significant relationship with the target variable.

# Fitting second linear model
my_model2 <- lm(points ~ attitude, data = learning2014)

# Printing out a summary of the model
summary(my_model2)
## 
## Call:
## lm(formula = points ~ attitude, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

Below is a code for a fitted line plot which shows a scatterplot of the data with a regression line representing the regression equation.

#Plotting the regression line by the model
ggplot(learning2014, aes(x = attitude, y = points)) + 
  geom_point() +
  stat_smooth(method = "lm", col = "red") +ggtitle("Fitted line plot") +theme_minimal()
## `geom_smooth()` using formula 'y ~ x'

The resulting model has such form

\(points = 11.64 + 3.53*attitude\).

The attitude coefficient in the regression equation is 3.53. This coefficient represents the mean increase of exam points for every additional unit in attitude measurement. That is, if the attitude value increases by 1, then exam points increase by 3.53.

R-squared is a measure of how good is the fit of linear regression models. It indicates the percentage of the variance in the dependent variable that the independent variables explain collectively. In this case independent variable is only attitude and based on R-squared value of summary table, it explains 19% of variance in the points variable. In the case when deep and stra variables are included, the value of R-squared is slightly bigger and shows that 21% of variability in points variable is explained.

Diagnostic plots

The last task is to produce diagnostic plots.

How well the model describes the phenomenon depends on how well the model assumptions fit reality. In linear regression model the main assumption is linearity and the errors are assumed to be normally distributed. Analyzing the residuals of the model provides a method to explore the validity of the model assumptions. A good way of such analysis is diagnostic plots which let to explore normal distribution of errors, constant variation of errors and leverage of observations.

Below is the code for diagnostic plots.

# Drawing the diagnostic plots.
par(mfrow = c(2,2))
plot(my_model2, which = c(1,2,5))

The first plot of residuals vs fitted values show no visible pattern and thus the assumption, that the variance of errors is constant, is not violated.

The second Normal QQ-plot shows reasonable fit with the line which confirms that distribution of errors is similar to normal distribution.

The third Residuals vs Leverage plot show how much impact a singe observation has on a model. It doesn’t look like any of observation would have an unusually high impact.

Conclusion

Even though there seems to be a significant linear relationship between attitude and exam points, the attitude variable explains only 19% of variability of exam points. Therefore, the model does not seem like a very good fit for the data.


Logistic regression

date()
## [1] "Thu May 06 22:19:29 2021"

This chapter describes an application of logistic regression model.

Data

The data of this chapter is student alcohol consumption data from UCI Machine Learning Repository. It has been pre-processed using the code in my GitHub repository. Data set consists of 35 variables and 370 observations. Information about the variables can be found here. Additional alc_use variable was defined by taking the average of weekday and weekend alcohol use. Additional high_use variable was defined by assigning TRUE value if alc_use is higher than 2 and FALSE value otherwise.

# Reading the data
alc <- read.csv('data/alc.csv')

# Printing out the names of the variables in the data 
names(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "schoolsup" 
## [16] "famsup"     "activities" "nursery"    "higher"     "internet"  
## [21] "romantic"   "famrel"     "freetime"   "goout"      "Dalc"      
## [26] "Walc"       "health"     "failures"   "paid"       "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"
# Looking at the dimensions of the data
dim(alc)
## [1] 370  35
# Looking at the structure of the data
str(alc)
## 'data.frame':    370 obs. of  35 variables:
##  $ school    : chr  "GP" "GP" "GP" "GP" ...
##  $ sex       : chr  "F" "F" "F" "F" ...
##  $ age       : int  15 15 15 15 15 15 15 15 15 15 ...
##  $ address   : chr  "R" "R" "R" "R" ...
##  $ famsize   : chr  "GT3" "GT3" "GT3" "GT3" ...
##  $ Pstatus   : chr  "T" "T" "T" "T" ...
##  $ Medu      : int  1 1 2 2 3 3 3 2 3 3 ...
##  $ Fedu      : int  1 1 2 4 3 4 4 2 1 3 ...
##  $ Mjob      : chr  "at_home" "other" "at_home" "services" ...
##  $ Fjob      : chr  "other" "other" "other" "health" ...
##  $ reason    : chr  "home" "reputation" "reputation" "course" ...
##  $ guardian  : chr  "mother" "mother" "mother" "mother" ...
##  $ traveltime: int  2 1 1 1 2 1 2 2 2 1 ...
##  $ studytime : int  4 2 1 3 3 3 3 2 4 4 ...
##  $ schoolsup : chr  "yes" "yes" "yes" "yes" ...
##  $ famsup    : chr  "yes" "yes" "yes" "yes" ...
##  $ activities: chr  "yes" "no" "yes" "yes" ...
##  $ nursery   : chr  "yes" "no" "yes" "yes" ...
##  $ higher    : chr  "yes" "yes" "yes" "yes" ...
##  $ internet  : chr  "yes" "yes" "no" "yes" ...
##  $ romantic  : chr  "no" "yes" "no" "no" ...
##  $ famrel    : int  3 3 4 4 4 4 4 4 4 4 ...
##  $ freetime  : int  1 3 3 3 2 3 2 1 4 3 ...
##  $ goout     : int  2 4 1 2 1 2 2 3 2 3 ...
##  $ Dalc      : int  1 2 1 1 2 1 2 1 2 1 ...
##  $ Walc      : int  1 4 1 1 3 1 2 3 3 1 ...
##  $ health    : int  1 5 2 5 3 5 5 4 3 4 ...
##  $ failures  : int  0 1 0 0 1 0 1 0 0 0 ...
##  $ paid      : chr  "yes" "no" "no" "no" ...
##  $ absences  : int  3 2 8 2 5 2 0 1 9 10 ...
##  $ G1        : int  10 10 14 10 12 12 11 10 16 10 ...
##  $ G2        : int  12 8 13 10 12 12 6 10 16 10 ...
##  $ G3        : int  12 8 12 9 12 12 6 10 16 10 ...
##  $ alc_use   : int  1 3 1 1 2 1 2 2 2 1 ...
##  $ high_use  : logi  FALSE TRUE FALSE FALSE TRUE FALSE ...

Interesting variables and hypothesis

I think that these four variables could be related with alcohol consumption:

  1. sex: males might have tendency to drink more;
  2. goout: students who go out often can go for drinks with friends;
  3. famrel: bad family relations could be a cause of drinking;
  4. studytime: maybe more motivated students who spend more time studying are less likely to engage in drinking.

Distributions of the chosen variables

# Accessing the tidyverse libraries dplyr and ggplot2
library(dplyr) 
## Warning: package 'dplyr' was built under R version 4.0.3
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)

Let’s look at the distribution of sex variable.

#Producing summary statistics of alcohol use
alc %>% group_by(sex, high_use) %>% summarise(count = n(),mean_alc_use = mean(alc_use))
## `summarise()` regrouping output by 'sex' (override with `.groups` argument)
## # A tibble: 4 x 4
## # Groups:   sex [2]
##   sex   high_use count mean_alc_use
##   <chr> <lgl>    <int>        <dbl>
## 1 F     FALSE      154         1.44
## 2 F     TRUE        41         2.54
## 3 M     FALSE      105         1.50
## 4 M     TRUE        70         3.36

It seems that more males have high alcohol consumption compared to female. My initial hypothesis was right.

Let’s look at distribution of goout variable.

# Initializing a plot of alc_use and going out
g2 <- ggplot(alc, aes(y = alc_use, x = factor(goout), col = sex))

# define the plot as a boxplot and draw it
g2 + geom_boxplot() + ggtitle("Student tendency to go out by alcohol consumption and sex") + theme_classic() + xlab('Going out') + ylab('Alcohol use')

Seems that students who tend to go out more, tend to drink more alcohol. My initial hypothesis was correct. Interestingly, for males, the median of alcohol consumption increases drastically if they go out with friends a lot. For females, median of alcohol consumption does not change between going out moderate amount of times and high amount of times.

Let’s look at famrel variable.

# Initializing a plot of alc_use and family relationships
g2 <- ggplot(alc, aes(x = factor(famrel), y = alc_use, col = sex))

# define the plot as a boxplot and draw it
g2 + geom_boxplot() + ggtitle("Student's family relationships by alcohol consumption and sex") + xlab('Family relationships') + ylab('Alcohol consumption') + theme_light()

My initial hypothesis that family relationships are related to alcohol consumption is correct. It is clearly visible for males that the worse relationship is in family, the higher is alcohol consumption. For females, the median of alcohol consumption is similar except when family relationships are very good. In that case, alcohol consumption drops.

Let’s look at studytime variable.

# Initializing a plot of alc_use and study time
g2 <- ggplot(alc, aes(x = factor(studytime), y = alc_use, col = sex))

# define the plot as a boxplot and draw it
g2 + geom_boxplot() + ggtitle("Student's study time by alcohol consumption and sex") + theme_dark() + xlab('Study time') + ylab('Alcohol use')

My initial hypothesis was correct. For both males and females, alcohol consumption drops when study time increases.

Logistic regression model

In this section logistic regression model will be fitted to explore the relationship between my chosen variables and the binary high/low alcohol consumption variable as the target variable.

# Subsetting initial data frame to include only chosen variables and converting them to factors
alc_subset <- alc %>% select(high_use,sex,goout,famrel,studytime) %>% mutate(high_use = factor(high_use),
sex = factor(sex),
goout = factor(goout),
famrel = factor(famrel),
studytime = factor(studytime))
str(alc_subset)
## 'data.frame':    370 obs. of  5 variables:
##  $ high_use : Factor w/ 2 levels "FALSE","TRUE": 1 2 1 1 2 1 1 1 2 1 ...
##  $ sex      : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
##  $ goout    : Factor w/ 5 levels "1","2","3","4",..: 2 4 1 2 1 2 2 3 2 3 ...
##  $ famrel   : Factor w/ 5 levels "1","2","3","4",..: 3 3 4 4 4 4 4 4 4 4 ...
##  $ studytime: Factor w/ 4 levels "1","2","3","4": 4 2 1 3 3 3 3 2 4 4 ...
# Finding the model with glm()
m <- glm(high_use ~ sex + goout + famrel + studytime, data = alc_subset, family = "binomial")

# Printing out a summary of the model
summary(m)
## 
## Call:
## glm(formula = high_use ~ sex + goout + famrel + studytime, family = "binomial", 
##     data = alc_subset)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6339  -0.7259  -0.4807   0.7514   2.5431  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.62942    1.09247  -1.491 0.135833    
## sexM         0.87854    0.27963   3.142 0.001679 ** 
## goout2       0.24560    0.69914   0.351 0.725369    
## goout3       0.59346    0.68118   0.871 0.383636    
## goout4       2.03442    0.68010   2.991 0.002778 ** 
## goout5       2.47951    0.70339   3.525 0.000423 ***
## famrel2      0.09062    1.05603   0.086 0.931619    
## famrel3      0.17934    0.94055   0.191 0.848782    
## famrel4     -0.51821    0.91743  -0.565 0.572171    
## famrel5     -1.04992    0.94231  -1.114 0.265193    
## studytime2  -0.34253    0.30894  -1.109 0.267537    
## studytime3  -0.89382    0.48761  -1.833 0.066796 .  
## studytime4  -1.29146    0.65058  -1.985 0.047136 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 363.90  on 357  degrees of freedom
## AIC: 389.9
## 
## Number of Fisher Scoring iterations: 5

As all variables in this model are factors, each of the variables are converted into so called dummy variables. The summary of the model shows that only all of famrel dummy coded variables do not have statistically significant relationship with high alcohol consumption. For other variables, at least one of dummy variable has a statistically significant relationship.

# Printing out the coefficients of the model
coef(m)
## (Intercept)        sexM      goout2      goout3      goout4      goout5 
## -1.62941587  0.87853522  0.24560330  0.59345587  2.03442345  2.47950909 
##     famrel2     famrel3     famrel4     famrel5  studytime2  studytime3 
##  0.09061581  0.17933751 -0.51821396 -1.04991854 -0.34253287 -0.89381785 
##  studytime4 
## -1.29145540

The positive sexM coefficient means that probability of high alcohol consumption increases by 0.87 when the person is a male. The positive goout2, goout3, goout4 and goout5 coefficients mean that the probability of high alcohol increases based on how often a person goes out by 0.24, 0.59, 2.03 and 2.47 accordingly. Depending on the family relationships probability of high alcohol consumption decreases when relationships are well and increases otherwise. The more study time increases, the more probability of high alcohol consumption decreases. It seems that the model confirms my previous hypothesis.

# Computing odds ratios (OR)
OR <- coef(m) %>% exp

# Computing confidence intervals (CI)
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
# Printing out the odds ratios with their confidence intervals
cbind(OR, CI)
##                     OR      2.5 %     97.5 %
## (Intercept)  0.1960441 0.01756286  1.4112813
## sexM         2.4073708 1.39914040  4.1989737
## goout2       1.2783923 0.35906327  6.0626843
## goout3       1.8102336 0.53279909  8.3774038
## goout4       7.6478415 2.26873938 35.4135478
## goout5      11.9354038 3.36543311 57.2822325
## famrel2      1.0948483 0.15189997 10.7759938
## famrel3      1.1964245 0.21464931  9.9233767
## famrel4      0.5955833 0.11227629  4.7645467
## famrel5      0.3499663 0.06201071  2.8900970
## studytime2   0.7099698 0.38706568  1.3037798
## studytime3   0.4090909 0.15006032  1.0324002
## studytime4   0.2748704 0.06709430  0.9039728

sexM, all goout variables and famrel2 as well as famrel3 variables have OR > 1 which means greater odds of association with high alcohol consumption.

Predictive power of the model

As famrel variable does not show statistical significance, let’s exclude it from the model.

# Fitting the model with glm()
m <- glm(high_use ~ sex + goout + studytime, data = alc_subset, family = "binomial")

# predict() the probability of high_use
probabilities <- predict(m, type = "response")

# Adding the predicted probabilities to 'alc_subset'
alc_subset <- mutate(alc_subset, probability = probabilities)

# Using the probabilities to make a prediction of high_use
alc_subset <- mutate(alc_subset, prediction = probability > 0.5)

# Looking at the last ten original classes, predicted probabilities, and class predictions
tail(alc_subset, 10)
##     high_use sex goout famrel studytime probability prediction
## 361     TRUE   M     3      5         1   0.3531731      FALSE
## 362     TRUE   M     3      4         1   0.3531731      FALSE
## 363     TRUE   M     1      4         1   0.2321607      FALSE
## 364     TRUE   M     4      3         2   0.5923356       TRUE
## 365    FALSE   M     2      3         1   0.2877723      FALSE
## 366     TRUE   M     3      4         1   0.3531731      FALSE
## 367    FALSE   M     2      4         3   0.1213033      FALSE
## 368     TRUE   M     4      4         1   0.6891062       TRUE
## 369     TRUE   M     4      5         2   0.5923356       TRUE
## 370    FALSE   M     2      4         1   0.2877723      FALSE
# Tabulating the target variable versus the predictions
table(high_use = alc_subset$high_use, prediction = alc_subset$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   240   19
##    TRUE     58   53

Table of target variable versus the predictions show that most of students with low alcohol consumption are predicted correctly. Approximately half of students with high alcohol consumption are predicted correctly and half incorrectly.

# Initializing a plot of 'high_use' versus 'probability' in 'alc_subset'
g <- ggplot(alc_subset, aes(x = probability, y = high_use, col = prediction))

# Define the geom as points and draw the plot
g + geom_point()

The plot illustrates that many of the students with high alcohol consumption are not predicted correctly.

# Define a loss function (average prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# Calling loss_func to compute the average number of wrong predictions in the (training) data
# alc_subset$high_use is a factor, therefore it needs to be converted into numeric vector.
# 1 is subracted because as.numeric() function converts True values to 2 and False values to 1
loss_func(class = as.numeric(alc_subset$high_use)-1,prob = alc_subset$probability)
## [1] 0.2081081

Now let’s look if model predicts better than random guessing.

# Instead of probabilities predicted by the model, random sample of 0's and 1's is generated by sample() function
loss_func(class = as.numeric(alc_subset$high_use)-1,prob = sample(0:1, nrow(alc_subset), replace=T))
## [1] 0.4864865

Logistic regression model has an error of 20% whereas random guessing gives error of around 50% which means that logistic regression model is reasonably informative with the lower error than random guessing.

Cross-validation

Let’s perform 10 fold cross validation.

# K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc_subset, cost = loss_func, glmfit = m, K = 10)

# Average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.227027

My model has a slightly smaller prediction error (0.23) using 10-fold cross-validation compared to the model introduced in DataCamp (which had about 0.26 error).


Clustering and classification

date()
## [1] "Thu May 06 22:19:32 2021"

Data

In this chapter the Boston data from the MASS package is analyzed.

Let’s load and explore the data.

# Accessing the MASS package
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
# Loading the Boston data
data("Boston")


# Looking at the structure of the dataset
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim (Boston)
## [1] 506  14

This dataframe contains 506 observation of 14 different variables. These variables are:

crim: per capita crime rate by town.

zn: proportion of residential land zoned for lots over 25,000 sq.ft.

indus: proportion of non-retail business acres per town.

chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).

nox: nitrogen oxides concentration (parts per 10 million).

rm: average number of rooms per dwelling.

age: proportion of owner-occupied units built prior to 1940.

dis: weighted mean of distances to five Boston employment centres.

rad: index of accessibility to radial highways.

tax: full-value property-tax rate per $10,000.

ptratio: pupil-teacher ratio by town.

black: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.

medv: median value of owner-occupied homes in $1000s.

lstat: lower status of the population (percent).

Graphical overview of the data

Let’s look at the summaries of the variables in the data.

# Looking at summary statistics
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

All of the variables have different scales. For example, rm varies from 3.5 to 8.7 and tax can vary between 187 to 711. This show that data will need to be standardized when applying clustering as features in the data set have large differences in their ranges. Features with higher ranges would have bigger influence on clustering compared to features with smaller ranges as clustering models are distance based algorithms.

Let’s look at the pairwise relationships between the variables.

# Plotting a matrix of the variables
pairs(Boston)

There are some visible linear relationships between the variables. For example, when medv (median value of owner-occupied homes in $1000s) increases rm (average number of rooms per dwelling) seem to increase as well.

Let’s look at the correlations.

# Accessing corrplot and tidyr libraries 
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.0.3
## corrplot 0.84 loaded
library(tidyr)
## Warning: package 'tidyr' was built under R version 4.0.3
# Loading package for color palletes
library(wesanderson)
## Warning: package 'wesanderson' was built under R version 4.0.3
# Calculating the correlation matrix and rounding it
cor_matrix <- cor(Boston) %>% round(digits = 2)

# Printing the correlation matrix
cor_matrix
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax ptratio
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58    0.29
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31   -0.39
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72    0.38
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04   -0.12
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67    0.19
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29   -0.36
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51    0.26
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53   -0.23
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91    0.46
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00    0.46
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46    1.00
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44   -0.18
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54    0.37
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47   -0.51
##         black lstat  medv
## crim    -0.39  0.46 -0.39
## zn       0.18 -0.41  0.36
## indus   -0.36  0.60 -0.48
## chas     0.05 -0.05  0.18
## nox     -0.38  0.59 -0.43
## rm       0.13 -0.61  0.70
## age     -0.27  0.60 -0.38
## dis      0.29 -0.50  0.25
## rad     -0.44  0.49 -0.38
## tax     -0.44  0.54 -0.47
## ptratio -0.18  0.37 -0.51
## black    1.00 -0.37  0.33
## lstat   -0.37  1.00 -0.74
## medv     0.33 -0.74  1.00
# Visualizing the correlation matrix
corrplot.mixed(cor_matrix, upper.col = wes_palette("Rushmore1", 100, type = "continuous"), lower.col = wes_palette("Rushmore1", 100, type = "continuous"), tl.col= 'black', number.cex = .45, tl.pos = "d", tl.cex = 0.6 )

The highest positive correlation (0.91) is between tax and rad variables. The highest negative correlation (-0.77) is between dis and age variables.

Standardizing the data

Let’s standardize the data by subtracting the column means from the corresponding columns and dividing the difference with standard deviation.

# Scaling the variables
boston_scaled <- scale(Boston)

# Printing summaries of the scaled variables
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865

As initial data is a matrix, it needs to be changed into a data frame object.

# Change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)
class(boston_scaled)
## [1] "data.frame"

Let’s create a categorical variable of the crime rate in the Boston dataset.

# Using quantiles as break points 
break_points <- quantile(boston_scaled$crim)
break_points
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
# Creating a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = break_points, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))

Let’s drop the old crime rate variable from the dataset.

# Removing original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# Adding the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

Let’s divide data into train and test sets.

# Setting the number of rows in the Boston dataset 
n <- nrow(boston_scaled)

# Choosing randomly 80% of the rows
set.seed(123)
ind <- sample(n,  size = n * 0.8)

# Creating a train set

train <- boston_scaled[ind,]

# Creating a test set 
test <- boston_scaled[-ind,]

Fitting the linear discriminant analysis

In this section LDA will be fitted using the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables.

# Fitting LDA
lda.fit <- lda(crime ~ ., data = train)

# Printing the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2549505 0.2500000 0.2500000 0.2450495 
## 
## Group means:
##                   zn      indus        chas        nox          rm        age
## low       1.01866313 -0.9066422 -0.08120770 -0.8835724  0.38666334 -0.9213895
## med_low  -0.07141687 -0.3429155  0.03952046 -0.5768545 -0.09882533 -0.3254849
## med_high -0.40148591  0.2162741  0.19544522  0.3637157  0.12480815  0.4564260
## high     -0.48724019  1.0149946 -0.03371693  1.0481437 -0.47733231  0.8274496
##                 dis        rad        tax    ptratio      black        lstat
## low       0.9094324 -0.6819317 -0.7458486 -0.4234280  0.3729083 -0.766883775
## med_low   0.3694282 -0.5395408 -0.4205644 -0.1079710  0.3103406 -0.164921798
## med_high -0.3720478 -0.4349280 -0.3191090 -0.2716959  0.1049654  0.009720124
## high     -0.8601246  1.6596029  1.5294129  0.8057784 -0.6383964  0.900379309
##                 medv
## low       0.47009410
## med_low   0.01548761
## med_high  0.19440747
## high     -0.71294711
## 
## Coefficients of linear discriminants:
##                  LD1         LD2        LD3
## zn       0.148390645  0.74870532 -1.0874785
## indus    0.040971465 -0.38126652 -0.1619456
## chas     0.002460776  0.03963849  0.1699312
## nox      0.312458150 -0.67564471 -1.3104018
## rm       0.011086947 -0.16058718 -0.1572603
## age      0.283482486 -0.38817624 -0.1013491
## dis     -0.079848789 -0.38493775  0.2108038
## rad      3.718978412  0.93123532 -0.4706522
## tax     -0.015197127  0.04230505  1.2889075
## ptratio  0.180294774 -0.01592588 -0.3558490
## black   -0.136724112  0.02948075  0.1288959
## lstat    0.145739238 -0.37823065  0.3345688
## medv     0.061327205 -0.53906503 -0.1509890
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9523 0.0364 0.0113
# The function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "orange", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# Setting target classes as numeric
classes <- as.numeric(train$crime)

# Plotting the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)

From LDA bi-plot it can be seen that rad variable is the most infuential linear separator for discriminating high crime rates. Although, some points of medium high rates are mixed in.

Making predictions with LDA model

Let’s save the crime categories from the test set and then remove it from the test dataset.

# Saving the correct classes from test data
correct_classes <- test$crime

# Removing the crime variable from test data
test <- dplyr::select(test, -crime)

NOw let’s make predictions with LDA model on the test set and cross tabulate the results.

# Predicting the classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# Cross tabulating the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       10      10        4    0
##   med_low    2      17        6    0
##   med_high   1       9       13    2
##   high       0       0        1   27

It seems that the model best predicts high crime rates. All 12 predictions predicting high crime rate were correct. It is hard for the model to separate between low, medium low and medium high crime rates.

Clustering

Let’s reload and scale Boston data set.

# Scaling the variables in Boston data set
boston_scaled2 <- scale(Boston)

Let’s calculate the Euclidean distances between the variables.

# Distances between the observations
dist_eu <- dist(boston_scaled2)

Let’s run k-means algorithm on the dataset choosing 2 clusters.

# k-means clustering
km <-kmeans(boston_scaled2, centers = 2)

# Plotting the Boston dataset with clusters
pairs(Boston, col = km$cluster)

Let’s investigate what is the optimal number of clusters and run the algorithm again.

library(ggplot2)
# Setting seed so that initial clusters would be assinged every time the same
set.seed(123)

# Determining the number of clusters
k_max <- 20

# Calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled2, k)$tot.withinss})

# Visualizing the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

It is hard to tell what is the optimal number of clusters as the drop is not sharp and reminds the form of exponential function. Maybe it is around 5-6.

Let’s run kmeans again with 6 clusters.

# k-means clustering
km <-kmeans(boston_scaled2, centers = 6)

# Plotting the Boston dataset with clusters
pairs(Boston, col = km$cluster)

Plotting the clusters in the data with ggpairs.

library(GGally)
## Warning: package 'GGally' was built under R version 4.0.3
options(warn = -1)
ggpairs(Boston, mapping = aes(color = factor(km$cluster)))

Some clusters seem to be further apart from others. It is hard to draw more conclusions from such plots.

LDA using the clusters as target classes (bonus)

Performing k-means on the original Boston data with some reasonable number of clusters.

# Scaling the variables
boston_scaled3 <- scale(Boston)

# Performing kmeans on scaled Boston data
km <-kmeans(boston_scaled3, centers = 6)
# Including clusters into scaled boston dataset

boston_clusters <- data.frame(boston_scaled3, km = km$cluster)

str(boston_clusters)
## 'data.frame':    506 obs. of  15 variables:
##  $ crim   : num  -0.419 -0.417 -0.417 -0.416 -0.412 ...
##  $ zn     : num  0.285 -0.487 -0.487 -0.487 -0.487 ...
##  $ indus  : num  -1.287 -0.593 -0.593 -1.306 -1.306 ...
##  $ chas   : num  -0.272 -0.272 -0.272 -0.272 -0.272 ...
##  $ nox    : num  -0.144 -0.74 -0.74 -0.834 -0.834 ...
##  $ rm     : num  0.413 0.194 1.281 1.015 1.227 ...
##  $ age    : num  -0.12 0.367 -0.266 -0.809 -0.511 ...
##  $ dis    : num  0.14 0.557 0.557 1.077 1.077 ...
##  $ rad    : num  -0.982 -0.867 -0.867 -0.752 -0.752 ...
##  $ tax    : num  -0.666 -0.986 -0.986 -1.105 -1.105 ...
##  $ ptratio: num  -1.458 -0.303 -0.303 0.113 0.113 ...
##  $ black  : num  0.441 0.441 0.396 0.416 0.441 ...
##  $ lstat  : num  -1.074 -0.492 -1.208 -1.36 -1.025 ...
##  $ medv   : num  0.16 -0.101 1.323 1.182 1.486 ...
##  $ km     : int  2 2 6 2 6 2 2 2 2 2 ...
# Performing LDA using the clusters as target classes

lda.fit.cluster <- lda(km ~ ., data = boston_clusters)

# The function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "orange", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# Setting target classes as numeric
classes <- as.numeric(boston_clusters$km)

# Plotting the lda results
plot(lda.fit.cluster, dimen = 2, col = classes, pch = classes)

# Plotting arrows arrows representing the relationships of the original variables to the LDA solution
lda.arrows(lda.fit, myscale = 1)

There seem to be two clusters which have longer distance from rest of the clusters. That distanced is influenced by rad variable.

Projection of the data points (super bonus)

model_predictors <- dplyr::select(train, -crime)
# Checking the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# Matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)

#install.packages('plotly')
library(plotly)
# Plotting with color argument equal to crime classes
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$crime)
# Plotting with color argument equal to kmeans clusters of the observations in train set of LDA.
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = factor(km$cluster[ind]))

These two plots are similar as one group of points is further apart from rest of the data points. The difference is that in first plot, the further away group is less mixed and mainly include points of one class.

I wonder how similar they would be if 4 cluster centers would be chosen instead of 6.

# Scaling the variables
boston_scaled3 <- scale(Boston)

# Performing kmeans on scaled Boston data
km <-kmeans(boston_scaled3, centers = 4)


# Plotting with color argument equal to kmeans clusters of the observations in train set of LDA.
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = factor(km$cluster[ind]))

It seems that further cluster still remains less homogenous than in the first plot with crime rate classes.


Dimensionality reduction techniques

date()
## [1] "Thu May 06 22:20:16 2021"

Data

In this chapter the human dataset, which originates from the United Nations Development Programme, is analyzed. The code of data preparation along with the data is in my Github repository.

Let’s load and explore the data.

# Loading the data
human <- read.table('data/human.csv',sep = ',',header = T, row.names = 1)

# Looking at the structure of the dataset
str(human)
## 'data.frame':    155 obs. of  8 variables:
##  $ Edu2.FM  : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ Labo.FM  : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ Life.Exp : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ Edu.Exp  : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ GNI      : int  64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
##  $ Mat.Mor  : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ Ado.Birth: num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ Parli.F  : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
dim (human)
## [1] 155   8
names(human)
## [1] "Edu2.FM"   "Labo.FM"   "Life.Exp"  "Edu.Exp"   "GNI"       "Mat.Mor"  
## [7] "Ado.Birth" "Parli.F"

The data set consists of 155 observations and 8 variables. Each observation has a name which indicates a country. The variables are:

Edu2.FM: ratio of proportion of females with at least secondary education and proportion of males with at least secondary education.

Labo2.FM: ratio of proportion of females in the labour force and proportion of males in the labour force.

Life.Exp: life expectancy at birth.

Edu.Exp: expected years of schooling.

GNI: gross National Income per capita.

Mat.Mor: maternal mortality ratio.

Ado.Birth: adolescent birth rate.

Parli.F: percetange of female representatives in parliament.

# Looking at the summary of the variables
summary(human)
##     Edu2.FM          Labo.FM          Life.Exp        Edu.Exp     
##  Min.   :0.1717   Min.   :0.1857   Min.   :49.00   Min.   : 5.40  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:66.30   1st Qu.:11.25  
##  Median :0.9375   Median :0.7535   Median :74.20   Median :13.50  
##  Mean   :0.8529   Mean   :0.7074   Mean   :71.65   Mean   :13.18  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:77.25   3rd Qu.:15.20  
##  Max.   :1.4967   Max.   :1.0380   Max.   :83.50   Max.   :20.20  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50

All of the variables have different scales. The features, especially, GNI, Mat.Mor, Ado.Birth exhibit high variability.

Let’s look at the pairwise relationships between the variables.

# Plotting a matrix of the variables
pairs(human)

library(corrplot)
library(dplyr)
# Computing the correlation matrix
cor(human) %>% corrplot(type = "upper", tl.cex=0.7)

Many variables seem to be correlated. For example, when Life.Exp (life expectancy at birth) increases, Edu.Exp (expected years of schooling) and GNI increase as well. With increasing Life.Exp, Mat.Mor (maternal mortality ratio) and Ado.Birth decreases (adolescent birth rate).

Let’s look at the distribution of the variables.

library(ggplot2)
# Ploting a histogram of life expectancy at birth variable
ggplot(human, aes(x=Life.Exp)) +theme_minimal() + geom_histogram(color="darkblue", fill="lightblue", binwidth=5) + ggtitle("Life expectancy at birth")

library(ggplot2)
# Ploting a histogram of life expectancy at birth variable
ggplot(human, aes(x=Edu.Exp)) +theme_minimal() + geom_histogram(color="darkblue", fill="lightblue", binwidth=1) + ggtitle("Expected years of schooling")+ stat_function(fun = function(x) dnorm(x, mean = mean(human$Edu.Exp), sd = sd(human$Edu.Exp)) * 1 * nrow(human))

library(ggplot2)
# Ploting a histogram of Mat.Mor variable
ggplot(human, aes(x=Mat.Mor)) +theme_minimal() + geom_histogram(color="darkblue", fill="lightblue", binwidth=50) + ggtitle("Maternal mortality ratio")

library(ggplot2)
# Ploting a histogram of Ado.Birth variable
ggplot(human, aes(x=Ado.Birth)) +theme_minimal() + geom_histogram(color="darkblue", fill="lightblue", binwidth=10) + ggtitle("Adolescent birth ratio")

library(ggplot2)
# Ploting a histogram of Parli.F variable
ggplot(human, aes(x=Parli.F)) +theme_minimal() + geom_histogram(color="darkblue", fill="lightblue", binwidth=5) + ggtitle("Percetange of female representatives in parliament")

Out of all distributions, only expected years of schooling resembles a normal distribution. Maternal mortality ratio is mostly quite low. However, there are very high ratio exceptions. Life expectancy has most counts between 70 and 80 years. In very few cases females take up more than 50% of parliament places.

Principal Component Analysis (PCA)

In this section PCA is applied on non-standartized and standartized and results are compared at the end of the section.

Non-standartized data

options(warn = -1)
# Performing PCA (with the SVD method)
pca_human_ns <- prcomp(human)

# Creating and printing out a summary of pca_human
s <- summary(pca_human_ns)
s
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7    PC8
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000 1.0000
# Rounded percentages of variance captured by each PC
pca_pr <- round(100*s$importance[2,], digits = 1) 

# Printing out the percentages of variance
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 
## 100   0   0   0   0   0   0   0
# Creating an object pc_lab to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")

# Drawing a biplot
biplot(pca_human_ns, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2], sub = 'Gross National Income per capita (GNI) has the highest variability out of all features and it highly contributes to PC1 component.', cex.sub = 0.7)

Standartized data

options(warn = -1)
# Standardizing the variables
human_std <- scale(human)


# Performing PCA (with the SVD method)
pca_human <- prcomp(human_std)


# Creating and printing out a summary of pca_human
s <- summary(pca_human)
s
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
##                            PC8
## Standard deviation     0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion  1.00000
# Rounded percentages of variance captured by each PC
pca_pr <- round(100*s$importance[2,], digits = 1) 

# Printing out the percentages of variance
pca_pr
##  PC1  PC2  PC3  PC4  PC5  PC6  PC7  PC8 
## 53.6 16.2  9.6  7.6  5.5  3.6  2.6  1.3
# Creating an object pc_lab to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")

# Drawing a biplot
biplot(pca_human, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2],sub= 'Standartized variables more equally contribute to the principal components.', cex.sub=0.7)

Standardization of the data set makes a very big difference. Without standardizing, very high variability in GNI highly contributes to the first principal component and thus, all variability in the data is explained only with PC1. When data is standardized, all variables contribute more equally to the components and the first principal component explains over 50% of variability in the data.

The angles between features in the biplot of standardized data are small. They represent high correlation between those features.

Ratio of proportion of females in the labour force and proportion of males in the labour force as well as percetange of female representatives in parliament are the only two variables which contribute to the second principal component. These two variables help to explain less variance in the data compared to education, life expectancy, maternal mortality and adolescent birth ratio variables.

Multiple Correspondence Analysis

In this section the tea dataset from the package FactoMineR is used. It contains the answers of a questionnaire on tea consumption.

# Loading tea dataset
library(FactoMineR)
data('tea')

# Looking at the structure of the dataset
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300  36
# Looking at the summaries of the variables
summary(tea)
##          breakfast           tea.time          evening          lunch    
##  breakfast    :144   Not.tea time:131   evening    :103   lunch    : 44  
##  Not.breakfast:156   tea time    :169   Not.evening:197   Not.lunch:256  
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##         dinner           always          home           work    
##  dinner    : 21   always    :103   home    :291   Not.work:213  
##  Not.dinner:279   Not.always:197   Not.home:  9   work    : 87  
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##         tearoom           friends          resto          pub     
##  Not.tearoom:242   friends    :196   Not.resto:221   Not.pub:237  
##  tearoom    : 58   Not.friends:104   resto    : 79   pub    : 63  
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##         Tea         How           sugar                     how     
##  black    : 74   alone:195   No.sugar:155   tea bag           :170  
##  Earl Grey:193   lemon: 33   sugar   :145   tea bag+unpackaged: 94  
##  green    : 33   milk : 63                  unpackaged        : 36  
##                  other:  9                                          
##                                                                     
##                                                                     
##                                                                     
##                   where                 price          age        sex    
##  chain store         :192   p_branded      : 95   Min.   :15.00   F:178  
##  chain store+tea shop: 78   p_cheap        :  7   1st Qu.:23.00   M:122  
##  tea shop            : 30   p_private label: 21   Median :32.00          
##                             p_unknown      : 12   Mean   :37.05          
##                             p_upscale      : 53   3rd Qu.:48.00          
##                             p_variable     :112   Max.   :90.00          
##                                                                          
##            SPC               Sport       age_Q          frequency  
##  employee    :59   Not.sportsman:121   15-24:92   1/day      : 95  
##  middle      :40   sportsman    :179   25-34:69   1 to 2/week: 44  
##  non-worker  :64                       35-44:40   +2/day     :127  
##  other worker:20                       45-59:61   3 to 6/week: 34  
##  senior      :35                       +60  :38                    
##  student     :70                                                   
##  workman     :12                                                   
##              escape.exoticism           spirituality        healthy   
##  escape-exoticism    :142     Not.spirituality:206   healthy    :210  
##  Not.escape-exoticism:158     spirituality    : 94   Not.healthy: 90  
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##          diuretic             friendliness            iron.absorption
##  diuretic    :174   friendliness    :242   iron absorption    : 31   
##  Not.diuretic:126   Not.friendliness: 58   Not.iron absorption:269   
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##          feminine             sophisticated        slimming          exciting  
##  feminine    :129   Not.sophisticated: 85   No.slimming:255   exciting   :116  
##  Not.feminine:171   sophisticated    :215   slimming   : 45   No.exciting:184  
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##         relaxing              effect.on.health
##  No.relaxing:113   effect on health   : 66    
##  relaxing   :187   No.effect on health:234    
##                                               
##                                               
##                                               
##                                               
## 

This dataset has 300 observations and 36 variables. As there are many variables, I will selecte only some of them for further analysis.

age_Q, sex, price, sugar, frequency,

library(dplyr)
library(ggplot2)
library(tidyr)
# Subsetting the dataset
selected_columns <- c ('age_Q', 'sex', 'price', 'sugar', 'frequency', 'relaxing' )
tea_selected <- dplyr::select(tea, one_of(selected_columns))

# Visualize the selected dataset
gather(tea_selected) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))

It seems that the highest tea consumption is in the age group of 15-24 year. The cheap tea isn’t chosen often. More people find tea drinking relaxing and more female than male drinks tea. Tea with no sugar is slightly more preferred.

Let’s do the multiple correspondence analysis.

# Multiple correspondence analysis
mca <- MCA(tea_selected, graph = FALSE)

# Summarizing the model
summary(mca)
## 
## Call:
## MCA(X = tea_selected, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.270   0.243   0.211   0.194   0.186   0.178   0.171
## % of var.             10.790   9.727   8.442   7.779   7.434   7.124   6.826
## Cumulative % of var.  10.790  20.517  28.958  36.737  44.171  51.295  58.121
##                        Dim.8   Dim.9  Dim.10  Dim.11  Dim.12  Dim.13  Dim.14
## Variance               0.165   0.149   0.144   0.142   0.122   0.117   0.113
## % of var.              6.596   5.971   5.767   5.674   4.873   4.660   4.510
## Cumulative % of var.  64.717  70.688  76.455  82.129  87.002  91.662  96.172
##                       Dim.15
## Variance               0.096
## % of var.              3.828
## Cumulative % of var. 100.000
## 
## Individuals (the 10 first)
##                    Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr
## 1               |  0.529  0.346  0.046 |  1.139  1.778  0.211 |  0.637  0.640
## 2               | -0.786  0.764  0.336 |  0.476  0.311  0.123 |  0.604  0.576
## 3               | -0.733  0.664  0.351 | -0.281  0.108  0.052 |  0.041  0.003
## 4               |  0.701  0.607  0.320 | -0.026  0.001  0.000 |  0.131  0.027
## 5               | -0.429  0.227  0.111 |  0.093  0.012  0.005 | -0.166  0.043
## 6               |  0.557  0.383  0.090 | -0.093  0.012  0.003 |  0.478  0.360
## 7               | -0.104  0.013  0.003 |  0.331  0.150  0.035 | -0.344  0.187
## 8               | -0.020  0.000  0.000 | -0.266  0.097  0.026 | -0.273  0.118
## 9               | -0.251  0.078  0.024 |  0.125  0.021  0.006 | -0.907  1.300
## 10              | -0.251  0.078  0.024 |  0.125  0.021  0.006 | -0.907  1.300
##                   cos2  
## 1                0.066 |
## 2                0.198 |
## 3                0.001 |
## 4                0.011 |
## 5                0.017 |
## 6                0.066 |
## 7                0.037 |
## 8                0.028 |
## 9                0.318 |
## 10               0.318 |
## 
## Categories (the 10 first)
##                     Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
## 15-24           |   0.504   4.818   0.112   5.799 |  -0.955  19.160   0.403
## 25-34           |   0.616   5.398   0.113   5.825 |   0.794   9.934   0.188
## 35-44           |  -0.175   0.253   0.005  -1.188 |   0.224   0.458   0.008
## 45-59           |  -0.990  12.305   0.250  -8.646 |   0.489   3.328   0.061
## +60             |  -0.567   2.515   0.047  -3.733 |  -0.150   0.196   0.003
## F               |  -0.385   5.447   0.217  -8.051 |  -0.449   8.209   0.295
## M               |   0.562   7.948   0.217   8.051 |   0.656  11.977   0.295
## p_branded       |   0.031   0.019   0.000   0.363 |  -0.283   1.740   0.037
## p_cheap         |  -0.445   0.285   0.005  -1.189 |  -0.881   1.241   0.019
## p_private label |   0.891   3.436   0.060   4.228 |  -0.239   0.275   0.004
##                  v.test     Dim.3     ctr    cos2  v.test  
## 15-24           -10.980 |   0.450   4.896   0.089   5.171 |
## 25-34             7.502 |  -0.318   1.839   0.030  -3.007 |
## 35-44             1.519 |  -0.245   0.632   0.009  -1.661 |
## 45-59             4.269 |   0.626   6.291   0.100   5.468 |
## +60              -0.989 |  -1.258  15.824   0.229  -8.283 |
## F                -9.384 |   0.232   2.523   0.079   4.846 |
## M                 9.384 |  -0.339   3.680   0.079  -4.846 |
## p_branded        -3.333 |   0.556   7.726   0.143   6.543 |
## p_cheap          -2.355 |   0.358   0.236   0.003   0.957 |
## p_private label  -1.136 |   0.823   3.742   0.051   3.903 |
## 
## Categorical variables (eta2)
##                   Dim.1 Dim.2 Dim.3  
## age_Q           | 0.409 0.483 0.373 |
## sex             | 0.217 0.295 0.079 |
## price           | 0.112 0.133 0.544 |
## sugar           | 0.509 0.001 0.000 |
## frequency       | 0.234 0.273 0.190 |
## relaxing        | 0.137 0.274 0.080 |
# Visualizing the MCA
plot(mca, invisible=c("ind"), habillage = "quali")

The horizontal dimension explains 10.79% of the variance in the data whereas the vertical dimension explains only 9.73%. Together both dimension explain only around 20% of variance in the data set.

The graph in MCA presents the information in a condense form where the distance between variable categories gives a measure of their similarity.

Let’s look at some possible conclusions and see if they can be drawn from the data as well.

  1. Tea with no sugar is more chosen by people from two older age groups.
# Initializing a plot of alc_use and going out
ggplot(tea_selected, 
       aes(x = age_Q, 
           fill = sugar )) + 
  geom_bar(position = position_dodge(preserve = "single")) +labs(
   title = 'Sugar preference in different age group') + theme_minimal()

Yes, from the barplot we can see, that more peaople in youger age groups prefer tea with sugar and more people in older groups, prefer the tea without sugar.

  1. Female are more likely to drink tea twice a day.
# Initializing a plot of alc_use and going out
ggplot(tea_selected, 
       aes(x = sex, 
           fill = frequency )) + 
  geom_bar(position = position_dodge(preserve = "single")) +labs(
   title = 'Tea drinking frequencies famale vs male') + theme_minimal()

Seems that this conclusion is visible in the data as well.

  1. Sugar is chosen by people who drink tea least often (1 to 2 times per week).
# Initializing a plot of alc_use and going out
ggplot(tea_selected, 
       aes(x = frequency, 
           fill = sugar )) + 
  geom_bar(position = position_dodge(preserve = "single")) +labs(
   title = 'Sugar preference in different age group') + theme_minimal()

It seems that this conclusion can be drawn from the data as well.


Analysis of longitudinal data

date()
## [1] "Thu May 06 22:20:22 2021"
# Loading packages
library(dplyr)
library(tidyr)
library(ggplot2)

Part 1. RATS data

# Reading the data set
RATSL <- read.csv('data/RATSL.csv')

# Looking at the structure of the data set
str(RATSL)
## 'data.frame':    176 obs. of  5 variables:
##  $ ID    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Group : int  1 1 1 1 1 1 1 1 2 2 ...
##  $ WD    : chr  "WD1" "WD1" "WD1" "WD1" ...
##  $ Weight: int  240 225 245 260 255 260 275 245 410 405 ...
##  $ Time  : int  1 1 1 1 1 1 1 1 1 1 ...
# Converting variables as factors
RATSL$ID <- factor(RATSL$ID)
RATSL$Group <-factor(RATSL$Group)
RATSL$WD <-factor(RATSL$WD)

str(RATSL)
## 'data.frame':    176 obs. of  5 variables:
##  $ ID    : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Group : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
##  $ WD    : Factor w/ 11 levels "WD1","WD15","WD22",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Weight: int  240 225 245 260 255 260 275 245 410 405 ...
##  $ Time  : int  1 1 1 1 1 1 1 1 1 1 ...

Let’s look at the plots of the data.

# Draw the plot:
ggplot(RATSL, aes(x = Time, y = Weight, linetype = ID)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ Group, labeller = label_both) +
  theme(legend.position = "none") + 
  scale_y_continuous(limits = c(min(RATSL$Weight), max(RATSL$Weight)))

We can see that group 1 has the smallest average weight. Group two and group three have a similar weight. There’s clearly visible outlier in the second group which has much higher weight then the rest of the rats in that group. In time the weight of the rats grows.

# Standartizing rats weight
RATSL <- RATSL %>% group_by(Time) %>%
  mutate(stdweight = (Weight - mean(Weight))/sd(Weight) ) %>% ungroup()

# Looking at the standartized data
glimpse(RATSL)
## Rows: 176
## Columns: 6
## $ ID        <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1,...
## $ Group     <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, ...
## $ WD        <fct> WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD...
## $ Weight    <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 55...
## $ Time      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, ...
## $ stdweight <dbl> -1.0011429, -1.1203857, -0.9613953, -0.8421525, -0.881900...
# Plotting again 

ggplot(RATSL, aes(x = Time, y = stdweight, linetype = ID)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ Group, labeller = label_both) +
  scale_y_continuous(name = "standardized weight")

After standardazing the weight, the growing trend is not anymore so visible in the data.

# Setting the number of times
n <- RATSL$Time %>% unique() %>% length()

# Summarizing the data with mean and standard error
RATSS <- RATSL %>%
  group_by(Group, Time) %>%
  summarise( mean = mean(Weight), se = sd(Weight)/sqrt(n) ) %>%
  ungroup()
## `summarise()` regrouping output by 'Group' (override with `.groups` argument)
# Looking at the data:
glimpse(RATSS)
## Rows: 33
## Columns: 4
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2...
## $ Time  <int> 1, 8, 15, 22, 29, 36, 43, 44, 50, 57, 64, 1, 8, 15, 22, 29, 3...
## $ mean  <dbl> 250.625, 255.000, 254.375, 261.875, 264.625, 265.000, 267.375...
## $ se    <dbl> 4.589478, 3.947710, 3.460116, 4.100800, 3.333956, 3.552939, 3...
# Plotting the mean profiles:
ggplot(RATSS, aes(x = Time, y = mean, linetype = Group, shape = Group)) +
  geom_line() +
  scale_linetype_manual(values = c(1,2,3)) +
  geom_point(size=3) +
  scale_shape_manual(values = c(1,2,3)) +
  geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3) +
  theme(legend.position = c(0.9,0.3)) +
  scale_y_continuous(name = "mean(Weight) +/- se(Weight)")

# Creating summary data
RATSL_2 <- RATSL %>%
  filter(Time > 1) %>%
  group_by(Group, ID) %>%
  summarise( mean=mean(Weight) ) %>%
  ungroup()
## `summarise()` regrouping output by 'Group' (override with `.groups` argument)
# Looking at the data
glimpse(RATSL_2)
## Rows: 16
## Columns: 3
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ ID    <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ mean  <dbl> 263.2, 238.9, 261.7, 267.2, 270.9, 276.2, 274.6, 267.5, 443.9...
# Drawing boxplots
ggplot(RATSL_2, aes(x = Group, y = mean)) +
  geom_boxplot() +
  stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "blue") +
  scale_y_continuous(name = "mean(Weight)")

We see the outliers of each group in the plot.There are three outliers. For the first and third group they are the smallest values and for the second group - the biggest value.

RATSL_2 %>%
  group_by(Group) %>%
  summarize(max(mean))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 2
##   Group `max(mean)`
##   <fct>       <dbl>
## 1 1            276.
## 2 2            594 
## 3 3            542.
RATSL_2 %>%
  group_by(Group) %>%
  summarize(min(mean))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 2
##   Group `min(mean)`
##   <fct>       <dbl>
## 1 1            239.
## 2 2            444.
## 3 3            495.

So for the first group the outlier value is 238.9, for the third group: 495.2 and for the second group: 594.

# Creating a new data by filtering the outliers
RATSL_3 <- RATSL_2 %>%
  filter(mean !=238.9 & mean != 495.2 & mean != 594)
# Drawing a boxplot
ggplot(RATSL_3, aes(x = Group, y = mean)) +
  geom_boxplot() +
  stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "blue") +
  scale_y_continuous(name = "mean(Weight)")

Now the data is without outliers.

Let’s perform a t-test to see if group 1 and group 2 differs.

# Filtering data to have group 1 and group 2
RATSL_2groups <- filter(RATSL_3,(Group==1 | Group==2))

# Performing a two-sample t-test
t.test(mean ~ Group, data = RATSL_2groups, var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  mean by Group
## t = -44.305, df = 8, p-value = 7.434e-11
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -193.2012 -174.0845
## sample estimates:
## mean in group 1 mean in group 2 
##        268.7571        452.4000

Based on the result of the t-test, we see that at 95% confidence level, there is a significant difference between two means. As the p value is lower that 0.05 the null hypothesis (that the means are equal) is rejected.

# Adding the baseline from the original data as a new variable to the summary data

RATS <- read.csv('data/RATS.csv')

RATSL_4 <- RATSL_2 %>%
  mutate(baseline = RATS$WD1)

# Fitting the linear model
fit <- lm(mean ~ baseline + Group, data = RATSL_4)

# Computing the analysis of variance table for the fitted model with anova()
anova(fit)
## Analysis of Variance Table
## 
## Response: mean
##           Df Sum Sq Mean Sq   F value   Pr(>F)    
## baseline   1 253625  253625 1859.8201 1.57e-14 ***
## Group      2    879     439    3.2219  0.07586 .  
## Residuals 12   1636     136                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Part 2. BPRS data

# Reading the data set
BPRSL <- read.csv('data/BPRSL.csv')

# Looking at the structure of the data set
str(BPRSL)
## 'data.frame':    360 obs. of  5 variables:
##  $ treatment: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ subject  : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ weeks    : chr  "week0" "week0" "week0" "week0" ...
##  $ bprs     : int  42 58 54 55 72 48 71 30 41 57 ...
##  $ week     : int  0 0 0 0 0 0 0 0 0 0 ...
# Converting variables as factors
BPRSL$treatment <- factor(BPRSL$treatment)
BPRSL$subject <- factor(BPRSL$subject)